Connecting To The Server To Fetch The WebPage Elements!!....
MXPlank.com MXMail Submit Research Thesis Electronics - MicroControllers Contact us QuantumDDX.com



Search The Site





 

Practical Issues in Neural Network Training-Regularization


Since a larger number of parameters causes overfitting, a natural approach is to constrain the model to use fewer non-zero parameters. In the previous example, if we constrain the vector W̄ to have only one non-zero component out of five components, it will correctly obtain the solution [2,0,0,0,0]. Smaller absolute values of the parameters also tend to overfit less. Since it is hard to constrain the values of the parameters, the softer approach of adding the penalty λ||W̄||p to the loss function is used. The value of p is typically set to 2, which leads to Tikhonov regularization. In general, the squared value of each parameter (multiplied with the regularization parameter λ>0) is added to the objective function. The practical effect of this change is that a quantity proportional to λWi is subtracted from the update of the parameter Wi. An example of a regularized version of Equation1.6 for mini-batch S and update step-size α > 0 is as follows:




Here,E[X̄] represents the current error (y-ŷ) between observed and predicted values of training instance X. One can view this type of penalization as a kind of weight decay during the updates. Regularization is particularly important when the amount of available data is limited. A neat biological interpretation of regularization is that it corresponds to gradual forgetting, as a result of which "less important" (i.e.,noisy) patterns are removed. In general, it is often advisable to use more complex models with regularization rather than simpler models without regularization.


As a side note, the general form of Equation 1.33 is used by many regularized machine learning models like least-squares regression, whereE(X̄) is replaced by the error-function of that specific model. Interestingly, weight decay is only sparingly used in the single-layer perceptron because it can sometimes cause overly rapid forgetting with a small number of recently misclassified training points dominating the weight vector; the main issue is that the perceptron criterion is already a degenerate loss function with a minimum value of 0 at W̄= 0 (unlike its hinge-loss or least-squares cousins).


This quirk is a legacy of the fact that the single-layer perceptron was originally defined in terms of biologically inspired updates rather than in terms of carefully thought-out loss functions. Convergence to an optimal solution was never guaranteed other than in linearly separable cases. For the single-layer perceptron, some other regularization techniques, which will be discussed in the coming posts are more commonly used.